Overview of Quantization

17

where, x is a real-valued input (activation or weight), S is a real-valued scaling factor, and Z

is an integer zero point. In addition, the INT function converts a real number to an integer

value via a rounding technique (e.g., round to nearest and truncation). This function is just

a mapping from real values x to some integer value. This method of quantization is also

known as uniform quantization.

Besides, non-uniform quantization methods produce quantized values that are not nec-

essarily uniformly spaced. The formal definition of non-uniform quantization is shown as

qx =

q1,

if xΔ1,

...

qi,

if Δi1 < xΔi,

...

qU,

if x > ΔU.

(2.4)

where qi represents the discrete quantization levels and Δi denotes the quantization steps.

When the value of a real number x falls between the quantization steps Δi1 and i + 1,

the quantizer Q projects it to the associated quantization level qi. It should be noted that

neither qi nor Δi are evenly spaced.

Nonuniform quantization can achieve higher accuracy for a fixed bit width because

it allows for better capturing of distributions by focusing on important value regions or

determining appropriate dynamic ranges. For example, various nonuniform quantization

techniques have been developed for bell-shaped distributions of weights and activations,

which often exhibit long tails. A commonly employed rule-based nonuniform quantization

method uses a logarithmic distribution, where the quantization steps and levels increase

exponentially rather than linearly.

Recent advances have approached it as an optimization problem to enhance quantization

performance. The goal is to minimize the difference between the original tensor and its

quantized counterpart by adjusting the quantization steps/levels in the quantizer qx.

minqqx x2

2

(2.5)

Nonuniform quantization can also be improved by making the quantizer itself trainable.

These methods are called learnable quantizers, and the quantization steps/levels are opti-

mized through an iterative process or gradient descent along with the model parameters.

Overall, nonuniform quantization can better represent data by distributing bits and

unevenly discretizing the range of parameters. However, this quantization type can be chal-

lenging to implement effectively on standard computation hardware such as a GPU and

a CPU. As a result, uniform quantization remains the prevalent method because of its

straightforward implementation and efficient mapping to hardware.

2.1.2

Symmetric and Asymmetric Quantization

The choice of the scaling factor, S, in Eq. 2 is crucial in uniform quantization. S determines

the size of each partition by dividing the range of real values, x, into a specified number of

segments. The value of S affects the granularity of the quantization and ultimately impacts

the accuracy of the quantized representation:

S = βα

2b 1,

(2.6)

where [α, β] is the clip range and b is the bit-width. The clipping range, [α, β], determines

the range of real values that should be quantized. The choice of this range is crucial, as

it determines the quantization’s precision and the quantized model’s overall quality. This